With this script LiP MS data from the MaxQuant software can be checked for quality. The only input needed is the txt folder of the MaxQuant output, the proteome as a FASTA file from Uniprot, and a mapping file with information about the experiment. The tidyverse package is mainly used.

Some concepts and ideas are taken from the PTXQC package, however the code was completely written by me and is considerably faster than PTXQC.

Preparation

Load packages

tidyverse: The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
janitor: Usefull for making column names more R friendly.
DT: Data table to create interactive tables for markdown.
data.table: Package similar to dplyr, faster but less intuitive.
phylotools: Has read.fasta function to import fasta files to R.
ggrepel: This package allows the addition of improved labels to ggplot.
pheatmap: Makes pretty and easy heatmaps.
dendsort: Sort the dendrogram created for heatmap.
plotly: Creates interactive figures from ggplot.
naniar: Useful for analysis of missing values

library(tidyverse)
library(janitor)
library(DT)
library(data.table)
library(ggrepel)
library(pheatmap)
library(dendsort)
library(phylotools)
library(plotly)
library(naniar)

Load datasets

Read in data and make them more R friendly with clean_names(). This converts column names to snake_case.

parameters.txt: File genereated by MaxQuant. Contains information about search and run parameters.
peptides.txt: File genereated by MaxQuant. The peptides table contains information on the identified peptides in the processed raw-files.
evidence.txt: File genereated by MaxQuant. The evidence file combines all the information about the identified peptides and normally is the only file required for processing the results. Additional information about the peptides, modifications, proteins, etc. can be found in the other files by unique identifier linkage.
summary.txt: File generated by MaxQuant. The summary file contains summary information for all the raw files processed with a single MaxQuant run. The summary information consists of some MaxQuant parameters, information of the raw file contents, and statistics on the peak detection. Based on this file a quick overview can be gathered on the quality of the data in the raw file.
msmsScans.txt: File generated by MaxQuant. Contains information about fragmentation and MS/MS fragment ions.
*_proteome.fasta: File containing informaition about every protein in the desired proteome. Contains Uniprot ID’s and sequence.
map.txt: File containing information about the experiment. Generated by the user.

Structure of the map.txt file:

  • sample_number: 1, 2, 3…
  • strain: E. coli (K12)…
  • biological_replicate: 1, 2, 3…
  • technical_replicate: 1, 2, 3..
  • pipline: LiP, Trypsin Digest
  • ms_running_order: 1, 2, 3…
  • vial_position: A1, A2, A3…
  • filename: Name of the file that was the input for MaxQuant.Take from summary file.
  • experiment: Abbreviated name of the sample. Specified in MaxQuant as “Experiment”. Take from summary file. E.g.: Control_01, Control_02, Treated_01…
  • replicate_group: One number for each replicate group. control, control, control, control, treated, treated, treated, treated…
fasta <- read.fasta(file.choose("."))%>%clean_names()
proteome <- fasta %>%
  mutate(uniprot_id = str_extract(seq_name, "\\|.*?\\|")%>% str_extract(.,"(\\w){1,}"))%>%
  select(-seq_name)
msmsscans <- fread("msmsScans.txt", na.strings = c("NaN")) %>% clean_names()
parameters <- fread("parameters.txt", na.strings = c("NaN")) %>% clean_names()
peptides <- fread("peptides.txt", na.strings = c("NaN")) %>% clean_names()
evidence <- fread("evidence.txt", na.strings = c("NaN"), colClasses = c(Intensity = "numeric")) %>% clean_names()
summary <- fread("summary.txt", na.strings = c("NaN")) %>% clean_names()
map <- fread("map.txt", na.strings = c("NaN")) %>% clean_names()

Description of variables:

Here we define which columns we want to keep for the further analysis of the parameters, summary, evidence and peptides dataframes.

Parameters

Version: MaxQuant version the search was performed with.
User name: User that performed the search.
Machine name: Computer the search carried out on.
Date of writing: The date the MaxQuant search was performed.
Min. peptide Length: The minimum peptide length.
Decoy mode: Specifies how decoys are created.
Fasta file: The reference database used.

Evidence

Sequence: The amino acid sequence of the identified peptide.
Modified sequence: Sequence representation including the post-translational modifications (abbreviation of the modification in brackets before the modified AA). The sequence is always surrounded by underscore characters (’_‘).
Modifications: Post-translational modifications contained within the identified peptide sequence.
Length: The length of the sequence stored in the column ’Sequence’.
Missed cleavages: Number of missed enzymatic cleavages.
Leading razor protein: The identifier of the best scoring protein, from the proteinGroups file this peptide is associated to.
Type: The type of the feature. ‘MSMS’ for an MS/MS spectrum without an MS1 isotope pattern assigned. ‘ISO-MSMS’ MS1 isotope cluster identified by MS/MS. ‘MULTI-MSMS’ MS1 labeling cluster identified by MS/MS. ‘MULTI-SECPEP’ MS1 labeling cluster identified by MS/MS as second peptide. ‘MULTI-MATCH’ MS1 labeling cluster identified by matching between runs. In case of label-free data there is no difference between ‘MULTI’ and ‘ISO’.
Experiment: The name of each sample. Specified in MaxQuant by the user.
Reverse: When marked with ‘+’, this particular peptide was found to be part of a protein derived from the reversed part of the decoy database. These should be removed for further data analysis.
Potential contaminant: When marked with ‘+’, this particular peptide was found to be part of a commonly occurring contaminant. These should be removed for further data analysis.
Charge: The charge-state of the precursor ion.
mass_error_ppm: Mass error of the recalibrated mass-over-charge value of the precursor ion in comparison to the predicted monoisotopic mass of the identified peptide sequence in parts per million.
uncalibrated_mass_error_ppm: Mass error of the uncalibrated mass-over-charge value of the precursor ion in comparison to the predicted monoisotopic mass of the identified peptide sequence.
retention_time: The uncalibrated retention time in minutes in the elution profile of the precursor ion.
retention_length: The total retention time length of the peak (last time point first time point).
match_time_difference: When the option match between runs is used in MaxQuant, this value indicates the time difference between the feature from the raw file it was taken from and the feature from the raw file it was matched to.
fraction: Column appears if fractions were selected in MQ. This is useful if sample was fractionated or if match between runs should not be performed between all samples.

Peptides

Sequence: The amino acid sequence of the identified peptide.
N-term cleavage window: Sequence window from -15 to 15 around the N-terminal cleavage site of this peptide.
C-term cleavage window: Sequence window from -15 to 15 around the C-terminal cleavage site of this peptide.
Amino acid before: The amino acid in the protein sequence before the peptide.
First amino acid: The amino acid in the first position of the peptide sequence.
Last amino acid: The amino acid in the last position of the peptide sequence.
Amino acid after: The amino acid in the protein sequence after the peptide.
A Count: The number of instances of the ‘A’ amino acid contained within the sequence.
R Count: The number of instances of the ‘R’ amino acid contained within the sequence.
N Count: The number of instances of the ‘N’ amino acid contained within the sequence.
D Count: The number of instances of the ‘D’ amino acid contained within the sequence.
C Count: The number of instances of the ‘C’ amino acid contained within the sequence.
Q Count: The number of instances of the ‘Q’ amino acid contained within the sequence.
E Count: The number of instances of the ‘E’ amino acid contained within the sequence.
G Count: The number of instances of the ‘G’ amino acid contained within the sequence.
H Count: The number of instances of the ‘H’ amino acid contained within the sequence.
I Count: The number of instances of the ‘I’ amino acid contained within the sequence.
L Count: The number of instances of the ‘L’ amino acid contained within the sequence.
K Count: The number of instances of the ‘K’ amino acid contained within the sequence.
M Count: The number of instances of the ‘M’ amino acid contained within the sequence.
F Count: The number of instances of the ‘F’ amino acid contained within the sequence.
P Count: The number of instances of the ‘P’ amino acid contained within the sequence.
S Count: The number of instances of the ‘S’ amino acid contained within the sequence.
T Count: The number of instances of the ‘T’ amino acid contained within the sequence.
W Count: The number of instances of the ‘W’ amino acid contained within the sequence.
Y Count: The number of instances of the ‘Y’ amino acid contained within the sequence.
V Count: The number of instances of the ‘V’ amino acid contained within the sequence.
Start position: Position of the first amino acid of this peptide in the protein sequence. (one-based)
End position: Position of the last amino acid of this peptide in the protein sequence. (one-based)

Summary

Experiment: The name of each sample. Specified in MaxQuant by the user.
ms: The number of MS spectra recorded in this raw file.
ms_ms: The number of MS/MS spectra recorded in this raw file.
ms_ms_submitted: The number of tandem MS spectra submitted for analysis.
ms_ms_identified_percent: The percentage of identified tandem MS spectra.
peptide_sequences_identified: The total number of unique peptide amino acid sequences identified from the recorded tandem mass spectra.
peaks: The total number of peaks detected in the full scans.
peaks_sequenced: The total number of peaks sequenced by tandem MS.
peaks_sequenced_percent: The percentage of peaks sequenced by tandem MS.
peaks_repeatedly_sequenced: The total number of peaks repeatedly sequenced (i.e. 1 or more times) by tandem MS.
peaks_repeatedly_sequenced_percent: The percentage of peaks repeatedly sequenced (i.e. 1 or more times) by tandem MS.
isotope_patterns: The total number of detected isotope patterns.
isotope_patterns_sequenced: The total number of isotope patterns sequenced by tandem MS.
isotope_patterns_repeatedly_sequenced: The total number of isotope patterns repeatedly sequenced (i.e. 1 or more times) by tandem MS.
isotope_patterns_repeatedly_sequenced_percent: The percentage of isotope patterns repeatedly sequenced (i.e. 1 or more times) by tandem MS.
av_absolute_mass_deviation_ppm:The average absolute mass deviation found comparing to the identification mass in parts per million. Only including MUILTI-MSMS points.
mass_standard_deviation_ppm: The standard deviation of the mass deviation found comparing to the identification mass in parts per million. Only including MUILTI-MSMS points.

row_parameters <- c("Version", "User name", "Machine name", "Date of writing", "Min. peptide Length", "Decoy mode", "Fasta file")

columns_evidence <- c("sequence", "modified_sequence", "modifications", "length", "missed_cleavages", "leading_razor_protein", "type", "experiment",  "intensity", "reverse", "potential_contaminant", "charge", "mass_error_ppm", "uncalibrated_mass_error_ppm", "retention_time", "retention_length", "calibrated_retention_time", "ms_ms_count", "match_time_difference")

columns_peptides <- c("sequence", "n_term_cleavage_window", "c_term_cleavage_window", "amino_acid_before", "first_amino_acid", "last_amino_acid", "amino_acid_after", "a_count", "r_count", "n_count", "q_count", "g_count", "h_count", "i_count", "l_count", "k_count", "m_count", "f_count", "p_count", "s_count", "t_count", "w_count", "y_count", "v_count", "c_count", "d_count", "e_count", "start_position", "end_position")

columns_summary <- c("experiment", "ms", "ms_ms", "ms_ms_submitted", "ms_ms_identified", "ms_ms_identified_percent", "peptide_sequences_identified", "peaks", "peaks_sequenced", "peaks_sequenced_percent", "peaks_repeatedly_sequenced", "peaks_repeatedly_sequenced_percent", "isotope_patterns", "isotope_patterns_sequenced", "isotope_patterns_repeatedly_sequenced", "isotope_patterns_repeatedly_sequenced_percent", "av_absolute_mass_deviation_ppm", "mass_standard_deviation_ppm")

Joining datasets

The peptides, evidence and map datasets are joined together to create one main dataset with all the information about the experiment. In addition only variables that are of further use will be retained. The new dataset is called complete.

The relevant variables from the summary dataset are selected and stored in sample_summary.

complete <- evidence %>%
  select(all_of(columns_evidence), one_of("fraction"))%>%
  left_join((select(peptides, all_of(columns_peptides))), by = "sequence") %>%
  left_join(map, by = "experiment")%>%
  mutate(N_term_tryp = ifelse(amino_acid_before == "-" | amino_acid_before == "K" | amino_acid_before == "R", TRUE, FALSE ))%>%
  mutate(C_term_tryp = ifelse(last_amino_acid == "K" | last_amino_acid == "R" | last_amino_acid == "-", TRUE, FALSE ))%>%
  mutate(pep_type = fct_relevel(factor(ifelse(N_term_tryp + C_term_tryp == 2, "Fully-tryptic", ifelse(N_term_tryp + C_term_tryp == 1,"Semi-tryptic", "Non-tryptic"))), "Non-tryptic", after = Inf))

sample_summary <- summary %>%
  select(all_of(columns_summary))%>%
  filter(experiment != "")
# determination of figure size variables. Can be ignored and has nothing to do with analysis
n_samples <- n_distinct(summary$experiment)-1
n_conditions <- n_distinct(map$replicate_group)
n_cols <- if_else(n_samples >16, 1, 2)
n_cols_topN <- case_when(n_samples/4 <=1 ~ 2, n_samples/4 >1 & n_samples/4 <2 ~ 3, n_samples/4 >=2 ~ 4)
MBR<- "MULTI-MATCH" %in% complete$type
tryptic <- "Trypsin Digest" %in% complete$pipeline
lip <- "LiP" %in% complete$pipeline
fractions <- "fraction" %in% names(complete)
#n_fractions <- n_distinct(summary$fraction)-1

Filter datasets to remove decoys and contaminants

complete_clean or complete_clean_contaminants: Complete dataset without decoys and potential contaminants or just without decoys.

complete_clean <- complete %>%
  filter(reverse != "+", potential_contaminant != "+")%>%
  select(-reverse, -potential_contaminant)

complete_clean_contaminants <- complete %>%
  filter(reverse != "+") %>%
  select(-reverse)

Quality Control

MaxQuant search parameters

The most interesting parameters of the MaxQuant search are displayed.

parameters %>% 
  filter(parameter %in% row_parameters)%>%
  knitr::kable()
parameter value
Version 1.6.8.0
User name jquast
Machine name RDSNX-SH02
Date of writing 11/05/2019 21:36:12
Min. peptide Length 6
Fasta file E:_temp_190829.fasta
Decoy mode revert

Summary of run statisitics

MaxQuant computes summaries for each sample and saves them in the summary file. Some of these are displayed here.

datatable(sample_summary, rownames = FALSE, extensions = 'FixedColumns',
          options = list(
            dom = 'til',
            pageLength = 8,
            lengthMenu = c(4, 8, 16, 32, 50, 100),
            scrollX = TRUE,
            fixedColumns = list(leftColumns = 1),
            scrollCollapse = TRUE
          ))

Overview summary plots

Plot for each paramter for easier visual identification of differences.

factor_levels_summary <- str_replace_all(str_replace_all(str_replace_all(as.character(columns_summary), fixed("_"), " "), fixed("percent"), "[%]"), fixed("ppm"), "[ppm]")

sample_summary %>%
  gather(title, value, -experiment)%>%
  mutate(title = str_replace_all(str_replace_all(str_replace_all(as.character(title), fixed("_"), " "), fixed("percent"), "[%]"), fixed("ppm"), "[ppm]"))%>%
  mutate(title = factor(title, levels = factor_levels_summary))%>%
  ggplot(aes(paste(str_to_title(str_replace_all(as.character(experiment), fixed("_"), " "))), value))+
  geom_col(fill = "steelblue1", col ='black')+
  labs(x= "", y="")+
  facet_wrap(~title, scales = "free", ncol=n_cols)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1, size =8))+
  geom_text(aes(label=round(value, digits=2)), position = position_stack(vjust = 0.5), size = 3)

RSD of median run intensities

RSD <- complete %>%
  filter(!is.na(intensity))%>%
  group_by(experiment)%>%
  summarize(median = median(intensity))%>%
  mutate(RSD_median = sd(median)/mean(median)*100, mean_median = mean(median), sd_median = sd(median))

RSD_lip <- complete %>%
  filter(!is.na(intensity), pipeline == "LiP")%>%
  group_by(experiment)%>%
  summarize(median = median(intensity))%>%
  mutate(RSD_median = sd(median)/mean(median)*100, mean_median = mean(median), sd_median = sd(median))

if(nrow(RSD_lip) == 0){RSD_lip[1, ] <- NA}

RSD_trypsin <- complete %>%
  filter(!is.na(intensity), pipeline == "Trypsin Digest")%>%
  group_by(experiment)%>%
  summarize(median = median(intensity))%>%
  mutate(RSD_median = sd(median)/mean(median)*100, mean_median = mean(median), sd_median = sd(median))

if(nrow(RSD_trypsin) == 0){RSD_trypsin[1, ] <- NA}

The relative standard deviation of the median run intensities gives an estimate of how similar the runs are to eachother. Ideally the value is low, which indicates a low standard deviation compared to the mean. The unit of RSD is percent. The RSD should not be calculated on log transformed values as it will result in an underestimation. There is a corrected formular for this case though.

LiP

RSD of median intensities: 11.27%
Mean of median intensities: 25469687
SD of median intensities: 2870036

Tryptic digest

RSD of median intensities: 8.58%
Mean of median intensities: 24064125
SD of median intensities: 2064072

medians_plot <- RSD%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), median, group = 1))+
  geom_line(size = 1)+
  labs(title = "Medians of run intensities", x = "Run", y = "Median")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

ggplotly(medians_plot)

Contaminants

The most abundant contaminants are displayed. The file containing the contaminants can be found under MaxQuant/bin/conf/contaminants.fasta. Note that there might be proteins in the list that are part of the proteome you are analysing. Ideally you used a contaminants file that does not contain proteins of your species of interest (might not apply for human keratines). In addition you should add proteinase K (PK) to the contaminants file for LiP-MS experiments. This alows you to visually access if PK was added in roughly equal amounts to all samples.

complete_clean_contaminants %>%
  group_by(experiment)%>%
  mutate(total_intensity = sum(intensity, na.rm = TRUE))%>%
  filter(potential_contaminant == "+")%>%
  group_by(experiment, leading_razor_protein, total_intensity)%>%
  summarize(intensity = sum(intensity, na.rm = TRUE))%>%
  mutate("contaminant_intensity_per" = (intensity/total_intensity*100))%>%
  ungroup()%>%
  group_by(experiment) %>%
  mutate(contaminant_protein = ifelse(contaminant_intensity_per %in% head(sort(contaminant_intensity_per, decreasing = TRUE),5), as.character(leading_razor_protein), "other"))%>%
  arrange(contaminant_intensity_per)%>%
  ungroup()%>%
  mutate(contaminant_protein = fct_inorder(factor(contaminant_protein)))%>%
  group_by(experiment, contaminant_protein)%>%
  summarize(contaminant_intensity_per = sum(contaminant_intensity_per))%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), y = contaminant_intensity_per, fill = contaminant_protein))+
  geom_col(col = "black")+
  scale_fill_discrete(name="Contaminant Protein")+
  labs(title = "Contaminants per Raw file", subtitle = "Excluding decoys", x = "Experiment", y = "% of total intensity")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Most common contaminants:

  • P06873_106-384: Proteinase K
  • P00761: Trypsin
  • P15636: Lys-C
  • P02533: Keratin, type I cytoskeletal 14
  • P13646-3: Keratin, type I cytoskeletal 13
  • P04264: Keratin, type II cytoskeletal 1
  • P13645: Keratin, type I cytoskeletal 10
  • P05784: Keratin, type I cytoskeletal 18
  • P07737: Profilin-1
  • P67936: Tropomyosin alpha-4 chain

Missed cleavages

The percentage of missed cleavages is evaluated for peptides identified by MS/MS. Contaminants are excluded. A missed cleavage is counted even if the R/K is followed by a P. Thus, the estimate is more conservative and higher than for other search engines. In addition if there is an R/K directly prior to the last R/K it is counted as a missed cleavage.

missed_cleavages <- complete_clean %>%
  filter(type == "MULTI-MSMS"| type == "MSMS")%>%
  count(experiment, missed_cleavages)%>%
  group_by(experiment)%>%
  mutate(total_peptides = sum(n))%>%
  group_by(experiment, missed_cleavages)%>%
  summarise(missed_cleavages_per = n/total_peptides*100)%>%
  ungroup()%>%
  mutate(missed_cleavages = fct_inorder(factor(missed_cleavages)))

missed_cleavages%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), y = missed_cleavages_per, fill = missed_cleavages))+
  geom_col(col = "black")+
  scale_fill_discrete(name="Missed cleavages")+
  geom_text(data = subset(missed_cleavages, missed_cleavages_per > 5), aes(label=round(missed_cleavages_per, digits=1)), position = position_stack(vjust = 0.5))+
  labs(title = "Missed cleavages per Raw file", subtitle = "Excluding decoys, excluding contaminants", x = "Experiment", y = "% of total peptides")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Charge distribution

Under common experimental conditions, tryptic peptides are generally expected to carry two charges (one N-terminal and one at the C-terminal R or K residue). However, charge states can also reach 3 (or higher). A single peptide species can occur in multiple charge states, charge two usually being the most abundant. Several factors, such as the presence of additional basic amino acids (e.g. due to missed cleavages), the spray parameters or the pH of the eluents, can shift the distribution towards higher charge states. In LiP conditions with more semi-tryptic peptides charge states are expected to be shifted towards lower charges.

charge <- complete %>%
  count(experiment, charge)%>%
  group_by(experiment)%>%
  mutate(total_peptides = sum(n))%>%
  group_by(experiment, charge)%>%
  summarise(charge_per = n/total_peptides*100)%>%
  ungroup()%>%
  mutate(charge = fct_inorder(factor(charge)))

charge %>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), y = charge_per, fill = charge))+
  geom_col(col = "black")+
  scale_fill_discrete(name="Charge")+
  geom_text(data = subset(charge, charge_per > 5), aes(label=round(charge_per, digits=1)), position = position_stack(vjust = 0.5))+
  labs(title = "Charge distribution per Raw file", subtitle = "Including decoys, including contaminants and including oversampled peptides", x = "Experiment", y = "% of total peptides")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Oversampling

An oversampled 3D-peak is defined as a peak whose peptide ion (same sequence and same charge state) was identified by at least two distinct MS2 spectra in the same Raw file. For high complexity samples, oversampling of individual 3D-peaks automatically leads to undersampling or even omission of other 3D-peaks, reducing the number of identified peptides. Oversampling occurs in low-complexity samples or long LC gradients, as well as undersized dynamic exclusion windows.

The MS/MS count variable from the evidence file is used for the calculation.

oversampling <- complete %>%
  filter(ms_ms_count != 0)%>%
  count(experiment, ms_ms_count)%>%
  group_by(experiment)%>%
  mutate(total_peptides = sum(n))%>%
  group_by(experiment, ms_ms_count)%>%
  summarize(ratio = n/total_peptides*100)%>%
  ungroup()%>%
  mutate(ms_ms_count = as.factor(ms_ms_count))%>%
  arrange(ratio)%>%
  mutate(identifications = fct_inorder(fct_other(ms_ms_count, keep = c(1, 2), other_level = "3+")))%>%
  group_by(experiment, identifications)%>%
  summarize(ratio = sum(ratio))

oversampling %>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), y = ratio, fill = identifications))+
  geom_col(col = "black")+
  scale_fill_discrete(name= str_wrap("MS/MS counts per 3D peak", 15))+
  geom_text(data = subset(oversampling, ratio > 5), aes(label=round(ratio, digits=1)), position = position_stack(vjust = 0.5))+
  labs(title = "Oversampling, identifications per 3D peak", subtitle = "Including decoys, including contaminants", x = "Experiment", y = "% of total peptides")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Uncalibrated mass error

Mass accuracy or mass error is describing the difference between the measured m/z and the real, exact m/z of that precursor ion. It is usually in the range of a few ppm and should be similar between samples.

This only includes peptides identified in MSMS (“MULTI-MSMS”) due to the way the search engine (Andromeda) performs recalibration. It is not clear if contaminant peptides are included or not but they are left out here.

complete_clean_contaminants %>%
  filter(!is.na(uncalibrated_mass_error_ppm), type == "MULTI-MSMS")%>%
  select(experiment, uncalibrated_mass_error_ppm)%>%
  group_by(experiment)%>%
  mutate(sd = sd(uncalibrated_mass_error_ppm))%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " "), "(sd =", as.character(round(sd, digits = 2)), ")"), uncalibrated_mass_error_ppm))+
  geom_boxplot()+
  labs(title = "Uncalibrated mass error", subtitle = "Excluding decoys, excluding contaminants", x = "Experiment", y = expression(paste(Delta, "mass [ppm]")))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Calibrated mass error

The mass error is calibrated after an initial Andromeda search by applying a calculated correction factor.

complete_clean_contaminants %>%
  filter(!is.na(mass_error_ppm), type == "MULTI-MSMS")%>%
  select(experiment, mass_error_ppm)%>%
  group_by(experiment)%>%
  mutate(sd = sd(mass_error_ppm))%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " "), "(sd =", as.character(round(sd, digits = 2)), ")"), mass_error_ppm))+
  geom_boxplot()+
  labs(title = "Calibrated mass error", subtitle = "Excluding decoys, excluding contaminants", x = "Experiment", y = expression(paste(Delta, "mass [ppm]")))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Peak width over retention time

The peak width over retention time is a measure for optimal chromatographic seperation of peptide elution peaks. All samples should show a similar behavior. This is especially important for run alignments.

For every minute of the run, the median peak width is calculated and displayed.

peak_width <- complete %>%
  select(experiment, retention_time, retention_length)%>%
  ggplot(aes(retention_time, retention_length))+
  stat_summary_bin(aes(col = paste(str_replace_all(as.character(experiment), fixed("_"), " "))), size = 1, geom = "line", binwidth = 1, fun = median)+
  labs(title = "Median peak width over retention time", subtitle = "Including decoys, including contaminants", x = "Retention time [min]", y = "Median peak width [min]", color = "Experiment")+
  theme_bw()
ggplotly(peak_width)

Identifications over retention time

The number of identified peptides over retention time. The number of identifications should be uniform across the gradient to ensure optimal duty cycles. An uneven distribution indicates that there is potential room for gradient optimization.

Including MBR. Binwidth is 1 minute.

id_over_RT <- complete_clean_contaminants %>%
  select(experiment, retention_time)%>%
  mutate(count = 1)%>%
  ggplot(aes(retention_time, count))+
  stat_summary_bin(aes(col = paste(str_replace_all(as.character(experiment), fixed("_"), " "))), size = 1, geom = "line", binwidth = 1, fun = sum)+
  labs(title = "Peptide identifications over retention time", subtitle = "Excluding decoys, including contaminants", x = "Retention time [min]", y = "ID count per minute", color = "Experiment")+
  theme_bw()

ggplotly(id_over_RT)

TopN over retention time

TopN is the number of MS/MS scans recorded per MS spectrum. TopN depends on the intensity threshold and amount of peaks. Furthermore, it depends on the total intensity of ions since the orbitrap needs longer for filling when peak intensities are lower. This leads to a lower topN per MS cycle. Sometimes it is specified in the method that there should be a max number of MS/MS scans triggered per MS spectrum. This is reflected by a line with a plateau. This is the case if MS/MS spectra are recorded by an orbitrap. A longer gradient or MS/MS acquisition by an ion trap (faster, but less accurate) might improve the situation and allow for deeper analysis of the sample.

Similar to ID over RT, this metric reflects the complexity of the sample at any point in time. The median TopN per 2 minute retention time bin is displayed.

topn_rt <- msmsscans %>%
  select(experiment, retention_time, ms_scan_number)%>%
  group_by(experiment)%>%
  mutate(binRT = round(retention_time/2)*2)%>%
  count(experiment, binRT, ms_scan_number, name="topN")%>%
  group_by(experiment, binRT)%>%
  mutate(median_topN = median(topN))

topn_rt_plot <- topn_rt%>%
  ggplot(aes(binRT, median_topN))+
  geom_line(aes(col = paste(str_replace_all(as.character(experiment), fixed("_"), " "))), size = 1, )+
  labs(title = "TopN over retention time", x = "Retention time [min]", y = "TopN [median per 2 min RT bin]", color = "Experiment")+
  theme_bw()

ggplotly(topn_rt_plot)

TopN frequencies

TopN is the number of MS/MS scans recorded per MS spectrum. The frequency for each TopN per experiment is displayed. This is a summary of TopN over RT.

topn_rt%>%
  ggplot(aes(topN))+
  geom_histogram(binwidth = 1, color = "black", fill = "steelblue1")+
  facet_wrap(~paste(str_replace_all(as.character(experiment), fixed("_"), " ")), ncol=n_cols_topN)+
  labs(title = "TopN frequency", x = "TopN", y = "Frequency")+
  theme_bw()

Alignment between runs

The alignment of runs is verified by checking the retention time differences of identical precursors across Raw files. The default window MaxQuant uses to align runs is 20 minutes. The residual retention time difference after alignment is shown.

For each precursor (with the same modification, sequence and charge) in the reference raw file, the already calibrated retention time difference of the same feature in all other files is calculated. If the alignment worked perfectly, the differences are very small (<0.7 min). Only real MS/MS identifications (“MULTI-MSMS”) are used. The calibrated retention time is used to compute the retention time difference.

Identical precursors (same modification, sequence and charge) across runs are filtered based on the following criteria:

Each group has to contain more than one observation. Each group that contains duplicate identical precursors with different elution times is removed completely. Each group has to contain the reference file, which correstponds to the first file in the experiment (ms_running_order).

If the alignment worked not very well and there are a lot of values outside of the 0.7 minute border and increase of the default alignment window in MaxQuant could yield a better alignment.

The order in which LC-MS runs are aligned is determined by hierarchical clustering, thereby avoiding reliance on a single master run. The terminal branches of the tree from the hierarchical clustering typically connect LC-MS runs of the same or neighboring fractions or replicate runs, since they are most similar. These cases are aligned first. Moving along the tree structure, increasingly dissimilar runs are integrated. The calibration functions that are needed to completely align LC-MS runs are usually timedependent in a non-linear way. Every pair-wise alignment step is performed by two dimensional Gaussian kernel smoothing of the mass matches between the two runs. Following the ridge of the highest density region determines the recalibration function. At each tree node the resulting recalibration function is applied to one of the two subtrees, while the other is left unaltered. This description was taken from the MaxLFQ paper. This is simplified here by just comparing all runs to the first run.

print("No Match between runs was carried out by MaxQuant")
alignment <- complete %>%
  filter(type == "MULTI-MSMS")%>%
  select(modified_sequence, charge, experiment, ms_running_order, calibrated_retention_time)%>%
  group_by(modified_sequence, charge)%>%
  filter(n()>1)%>%
  group_by(modified_sequence, charge, ms_running_order)%>%
  mutate(n = n())%>%
  group_by(modified_sequence, charge)%>%
  filter(!any(n!=1))%>%
  filter(any(ms_running_order == 1))%>%
  group_by(modified_sequence, charge)%>%
  select(-n)%>%
  arrange(ms_running_order, .by_group = TRUE)%>%
  mutate(delta_RT = calibrated_retention_time - first(calibrated_retention_time))%>%
  group_by(experiment)%>%
  mutate(fraction_good = (sum(delta_RT<0.7&delta_RT>-0.7)/n()))%>%
  mutate(quality = ifelse(delta_RT<0.7 & delta_RT>-0.7, "Good (<0.7 min)", "Bad (>0.7 min)"))

alignment %>% 
  ggplot(aes(calibrated_retention_time, delta_RT, col = quality))+
  geom_point()+
  facet_wrap(~paste(str_to_title(str_replace_all(as.character(experiment), fixed("_"), " ")), paste("(Alignment: ",(round(alignment$fraction_good,3)*100), "%)" )), ncol=2)+
  labs(title = "Run alignment", subtitle = "Including decoys, including contaminants", x = "Calibrated Retention time [min]", y = expression(paste(Delta, "RT [min]")), color = "Quality")+
  coord_cartesian(ylim = c(2, -2))+
  theme_bw()
alignment <- complete %>%
  filter(type == "MULTI-MSMS")%>%
  select(modified_sequence, charge, experiment, ms_running_order, calibrated_retention_time, fraction)%>%
  group_by(fraction)%>%
  mutate(min_run = min(ms_running_order))%>%
  group_by(modified_sequence, charge, fraction)%>%
  filter(n()>1)%>%
  group_by(modified_sequence, charge, ms_running_order, fraction)%>%
  mutate(n = n())%>%
  group_by(modified_sequence, charge, fraction)%>%
  filter(!any(n!=1))%>%
  filter(any(ms_running_order == min_run))%>%
  group_by(modified_sequence, charge, fraction)%>%
  select(-n)%>%
  arrange(ms_running_order, .by_group = TRUE)%>%
  mutate(delta_RT = calibrated_retention_time - first(calibrated_retention_time))%>%
  group_by(experiment, fraction)%>%
  mutate(fraction_good = (sum(delta_RT<0.7&delta_RT>-0.7)/n()))%>%
  mutate(quality = ifelse(delta_RT<0.7 & delta_RT>-0.7, "Good (<0.7 min)", "Bad (>0.7 min)"))

alignment %>% 
  ggplot(aes(calibrated_retention_time, delta_RT, col = quality))+
  geom_point()+
  facet_wrap(~paste(str_replace_all(as.character(experiment), fixed("_"), " "), paste("(Alignment: ",(round(alignment$fraction_good,3)*100), "%)" )), ncol=2)+
  labs(title = "Run alignment of fractions", subtitle = "Including decoys, including contaminants", x = "Calibrated Retention time [min]", y = expression(paste(Delta, "RT [min]")), color = "Quality")+
  coord_cartesian(ylim = c(2, -2))+
  theme_bw()

ID transfer of match between runs

After retention time alignment MaxQuant tries to transfer IDs to files that are lacking identifications.

An unidentified 3D peak (target) receives an annotation if a genuinely identified counterpart from another Raw file (source) has a similar calibrated retention time (0.7 to 2 min deviation by default, depending on the MaxQuant version) and if its m/z matches the theoretical m/z of the peptide to be transferred (within 4.5−7 ppm by default, depending on the MaxQuant version).

MaxQuant reports the RT difference between the source identification and its target in the “match-time-difference” column of the evidence table. However, small values do not indicate that this matching is correct, since any unannotated 3D peak with similar RT and m/z is a putative target candidate. Therefore, a robust alignment is paramount for the ID-transfer.

With this plot we compare all transferred identifications to the genuine identifications within each Raw file. If the transfer was correct, then no identification should occur more than once. In particular, a genuine ID (locally confirmed by MS2) and a transferred ID in the same Raw file indicate that the matching targeted the wrong 3D peak (generating a false positive) because the same peptide was already identified genuinely. Alternatively, the MaxQuant feature finding algorithm accidentally split a 3D peak into two separate entities, where only one was identified by MS2. In this case, the genuine and transferred IDs will have similar corrected retention times.

Every 3D peak is assigned into one of three classes: “Single”, “Group − in width”, and “Group − out width”. The “Single” class covers all 3D peaks whose peptide sequence and charge state is unique for the Raw file at hand. The other two classes represent 3D peaks that are part of a group, i.e., that have siblings with identical sequence and charge in the same Raw file.

Within each peak group, the retention time deviations are used to decide if the group is valid (“in width”) or invalid (“out width”). The threshold to decide if a peak group is “in width” is the median RT peak width of the respective Raw file. If the RT span of the group is larger than the typical RT peak, then the evidence is considered segmented and it is assigned to the out-width class, i.e., it is unlikely that the outwidth group represents a split 3D peak but, rather, two (or more) entirely different 3D peaks.

Depending on which subset of peaks is used to assign the three classes, different conclusions can be drawn. Considering only genuine 3D peaks (MULTI-MSMS) and assigning them to a class, we can determine the intrinsic segmentation of a Raw file. The proportion of out-width peaks is usually very small, since a peptide usually elutes only once from the LC column. If we consider only the subset of 3D peaks that were identified via MBR (MULTI-MATCH) plus all genuine 3D peaks that have the same identification (MULTI-MSMS), then we can draw conclusions about the success of the ID-transfer (Transferred peptides).

The out-width fraction can rise considerably, depending on the success of the alignment, resulting in a lower score. If the median peak width over RT had abnormalities, peaks lying in areas with broad peak width that have multiple identientifications (which might occure naturally since the peak is wide) can also yield a high fraction of the out-width fraction. This is likely the case if the genuine group has many out-width peaks.

Finally, if we consider all 3D peaks (irrespective of if they are genuine or transferred) and assign a class to them, then we obtain an overall view on the segmentation issue.

print("No Match between runs was carried out by MaxQuant")
ID_transfer_single <- complete %>%
  select(experiment, modified_sequence, charge, type, match_time_difference)%>%
  filter(type == "MULTI-MSMS"| type == "MULTI-MATCH")%>%
  mutate(hasMTD = !is.na(match_time_difference))%>%
  group_by(experiment, modified_sequence, charge)%>%
  mutate(n_native = sum(!hasMTD), n_matched = sum(hasMTD))%>%
  select(-type, -match_time_difference, -hasMTD)%>%
  distinct()%>%
  group_by(experiment)%>%
  mutate(single_native = sum(n_native[n_native == 1], n_matched[n_native == 1])/sum(n_native[n_native != 0], n_matched[n_native != 0]))%>%
  mutate(single_matched = sum(n_matched[n_matched == 1 & n_native == 0])/sum(n_matched[n_matched !=0], n_native[n_matched !=0]))%>%
  mutate(single_all = sum(n_native[n_native == 1 & n_matched == 0], n_matched[n_matched == 1 & n_native == 0])/sum(n_native, n_matched))%>%
  distinct(experiment, single_native, single_matched, single_all)
rt_diff <- complete %>%
  select(experiment, modified_sequence, charge, type, calibrated_retention_time)%>%
  #filter(type == "MULTI-MSMS"| type == "MULTI-MATCH" | type == "MULTI-SECPEP")%>%
  group_by(experiment, modified_sequence, charge)%>%
  mutate(singlets = n()==1)%>%
  filter(singlets == FALSE)%>%
  mutate(type_2 = type != "MULTI-MATCH", type_3 = type != "MSMS", type_4 = type != "MULTI-SECPEP" )%>%
  select(-singlets)%>%
  group_by(experiment, modified_sequence, charge, type)%>%
  mutate(rt_diff_genuine = if_else(max(type_2) ==1 & max(type_3) != 0 & max(type_4) != 0, max(calibrated_retention_time)-min(calibrated_retention_time), 0))%>%
  group_by(experiment, modified_sequence, charge)%>%
  mutate(rt_diff_mixed = if_else(type_2 ==0 & rt_diff_genuine == 0 & max(type_3) != 0 | type_2 == 0 & rt_diff_genuine == 0, max(calibrated_retention_time)-min(calibrated_retention_time),-1 ))%>%
  filter(type != "MSMS")%>%
  na_if(-1)%>%
  mutate_at(vars(rt_diff_genuine), na_if, 0)%>%
  distinct(experiment, modified_sequence, charge, rt_diff_genuine, rt_diff_mixed)%>%
  filter(!is.na(rt_diff_genuine) | !is.na(rt_diff_mixed))%>%
  group_by(experiment, modified_sequence, charge)%>%
  summarize(rt_diff_genuine = max(rt_diff_genuine, na.rm = TRUE), rt_diff_mixed = max(rt_diff_mixed, na.rm = TRUE))%>%
  na_if(-Inf)%>%
  ungroup()
median_peak_width <- complete %>%
  select(experiment, retention_length)%>%
  group_by(experiment)%>%
  mutate(median_pw = median(retention_length))%>%
  distinct(experiment, median_pw)

in_rt_diff <- rt_diff %>%
  select(-modified_sequence, -charge)%>%
  left_join(median_peak_width, by = "experiment")%>%
  group_by(experiment)%>%
  mutate(in_rt_genuine = sum(rt_diff_genuine < median_pw, na.rm = TRUE) / sum(!is.na(rt_diff_genuine)))%>%
  mutate(in_rt_mixed = sum(rt_diff_mixed < median_pw, na.rm = TRUE) / sum(!is.na(rt_diff_mixed)))%>%
  mutate(rt_diff_all = pmax(rt_diff_mixed, rt_diff_genuine, na.rm = TRUE))%>%
  #takes max of genuine and mixed if there are both for a given peptide
  mutate(in_rt_all = sum(rt_diff_all < median_pw, na.rm = TRUE) / sum(!is.na(rt_diff_all)))%>%
  distinct(experiment, in_rt_genuine, in_rt_mixed, in_rt_all)
ID_transfer <- ID_transfer_single %>%
  left_join(in_rt_diff, by = "experiment")%>%
  mutate(non_single_native = 1 - single_native, non_single_matched = 1 - single_matched, non_single_all = 1 - single_all)%>%
  mutate(genuine_in = non_single_native * in_rt_genuine, genuine_out = non_single_native * (1- in_rt_genuine))%>%
  mutate(transferred_in = non_single_matched * in_rt_mixed, transferred_out = non_single_matched * (1- in_rt_mixed))%>%
  mutate(all_in = non_single_all * in_rt_all, all_out = non_single_all * (1 - in_rt_all))%>%
  select(-c(5:10))%>%
  gather("class", "value", -experiment)%>%
  mutate(peak_class = case_when(
    str_detect(class, "single") ~ "Single",
    str_detect(class, "_in") ~ "Group in width (< median peak width)",
    str_detect(class, "_out") ~ "Group out width (> median peak width)"))%>%
  mutate(transfer_type = case_when(
    str_detect(class, "genuine|native") ~ "Detected",
    str_detect(class, "transferred|matched") ~ "Transferred",
    str_detect(class, "all") ~ "All"))%>%
  select(-class)%>%
  mutate(value = value *100)%>%
  mutate(peak_class = fct_relevel(as.factor(peak_class), c("Group out width (> median peak width)", "Group in width (< median peak width)", "Single")))%>%
  mutate(transfer_type = fct_relevel(as.factor(transfer_type), c("Detected", "Transferred", "All")))

ID_transfer %>%
  ggplot(aes(paste(str_to_title(str_replace_all(as.character(experiment), fixed("_"), " "))), value, fill = peak_class))+
  geom_col()+
  facet_wrap(~ transfer_type, ncol =1)+
  labs(title = "ID Transfer", x = "Experiment", y = "Fraction of 3D-peaks [%]", fill = "Peak type")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Peptides inferred by match between runs

The number peptides that were identified by match between runs and their contribution to the MS/MS identified peptides is calculated.

print("No Match between runs was carried out by MaxQuant")
MBR_inferred <- complete %>%
  select(experiment, type)%>%
  group_by(experiment)%>%
  mutate(n = n())%>%
  mutate(match = ifelse(type == "MULTI-MATCH", 1, 0))%>%
  mutate(mbr = sum(match))%>%
  mutate(gain = (mbr/(n-mbr)*100))%>%
  select(-type, -match)%>%
  distinct()%>%
  ggplot(aes(mbr, gain, col = paste(str_replace_all(as.character(experiment), fixed("_"), " ")) ))+
  geom_point()+
  labs(title = "Peptides inferred by MBR", subtitle = "Including decoys, including contaminants", x = "Number of transfered ID's", y = "Gain on top of genuine ID's [%]", color = "Experiment")+
  expand_limits(x = 0, y = 0)+
  theme_bw()

ggplotly(MBR_inferred)

Qualitative analysis

Peptides

The total amount of identified peptidoforms is 54612. Peptidoforms distinguish single peptides that are modified differently.

complete_clean %>%
  summarize(total_peptidoforms = n_distinct(modified_sequence))

Possible modifications include: Unmodified, Acetyl (Protein N-term), Oxidation (M), Acetyl (Protein N-term),Oxidation (M), 2 Oxidation (M), Acetyl (Protein N-term),2 Oxidation (M)

The total amount of uniquely identified peptides independent modification-status is 54355. That means even if a peptide is identified multiple times with and without modifications it is only counted once.

complete_clean %>%
  summarize(total_peptides = n_distinct(sequence))

The amount of peptides that were identified in only one, up to all samples was calculated. The plot shows the sum of peptides for the indecated amount of samples they were identified in.

peptides_samples <- complete_clean %>%
  distinct(sequence, experiment)%>%
  count(sequence, name = "number_of_samples")%>%
  count(number_of_samples, name = "number_of_peptides")

peptides_samples_lable <- peptides_samples$number_of_peptides

peptides_samples%>%
  ggplot(aes(factor(number_of_samples),number_of_peptides))+
  geom_col(fill = "steelblue1", col = "black")+
  geom_text(aes(label=peptides_samples_lable, y = number_of_peptides + max(number_of_peptides)*0.03), size=4)+
  labs(title = "Number of peptides identified in X amount of samples", x = "Amount of Samples", y = "Number of Peptides")+
  theme_bw()+
  coord_cartesian(ylim = c(0, (max(peptides_samples$number_of_peptides))+1000))

complete_clean %>%
  distinct(sequence, experiment)%>%
  group_by(sequence)%>%
  mutate(n_samples_peptide = n())%>%
  group_by(n_samples_peptide, experiment)%>%
  count(name = "frequency")%>%
  ungroup()%>%
  mutate(n_samples_peptide = as.factor(n_samples_peptide))%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), frequency, fill = n_samples_peptide))+
  geom_bar(stat = 'identity', col = "black") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))+
  scale_fill_manual(values = scales::seq_gradient_pal("dodgerblue4", "steelblue1", "Lab")(seq(0,1,length.out=n_distinct(complete_clean$experiment))))+
  labs(title = "Number of peptide by frequency per sample", x = "Sample", y = "Number of Peptides", fill = "Frequency")

The number of total peptides identified in each sample is analysed. In addition the cumulative identified total unique peptides is shown.

cum_peptides <- complete_clean %>%
  distinct(sequence, experiment)%>%
  spread(experiment, - sequence)%>%
  gather("experiment", "presence", -sequence)%>%
  mutate(presence = case_when(is.na(presence) ~ FALSE, TRUE ~ TRUE))%>%
  group_by(sequence)%>%
  mutate(csum = cumsum(presence))%>%
  group_by(experiment)%>%
  filter(csum != 0)%>%
  summarize(total_cum_peptides = n())%>%
  mutate(experiment = factor(experiment))

#adding total_peptides per run

cum_peptides <- complete_clean %>%
  distinct(sequence, experiment)%>%
  count(experiment, name = "total_peptides")%>%
  left_join(cum_peptides, by = "experiment")

#gather + make "type" factor and reverse order of levels

cum_peptides <- cum_peptides %>%
  gather("type", "peptides", -experiment)%>%
  mutate(type = fct_rev(type))%>%
  arrange(desc(type))

#plotting

cum_peptides%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), peptides, fill = type))+
  geom_col(position = "identity", col = "black")+
  scale_fill_manual(values = c("total_peptides" = "steelblue1", "total_cum_peptides" = "brown1"), name = "Peptides", labels = c("per Sample", "cumulative over Samples"))+
  geom_text(aes(label=peptides, y = peptides + max(peptides) *0.03), size =3)+
  labs(title = "Total unique Peptides per Sample and cumulative over Samples", x = "Experiment", y = "Number of Peptides")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Semi-tryptic peptides

Each sample contains not only fully-tryptic peptides, but also semi- and non-tryptic peptides. Fully tryptic peptides are peptides that are predicted to occure through standard trypsin digestion. In semi-tryptic peptides only one side (start or end) conforms to this pattern. Non-tryptic peptides do not end with a K/R nor was the preciding amino acid a K/R.

For LiP samples it is expected that about 30-40% of peptides are semi-tryptic due to proteinase K digestion. Also the amount of non-tryptic peptides can be slighly increased.

semi_tryptic <- complete_clean %>%
  count(experiment, pep_type, name = "count")%>%
  group_by(experiment)%>%
  mutate(relative = count/sum(count)*100)

semi_tryptic %>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), relative, fill = pep_type))+
  geom_bar(stat = "identity", position="dodge", col ="black")+
  labs(title = "Peptide types", subtitle = "Excluding decoys, excluding contaminants", x = "Experiment", y = "Percentage of all peptides", fill="Type")+
  geom_text(aes(label= round(relative,1), y = relative + max(relative)*0.03), position = position_dodge(1), size=4)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))+
  coord_cartesian(ylim = c(0, (max(semi_tryptic$relative))+5))

Proteins

The total amount of proteins in the reference proteome is 20416.
The total amount of identified proteins is 4480.

proteins <- complete_clean %>%
  summarize(proteins_detected = n_distinct(leading_razor_protein))%>%
  mutate(experiment = "Total")

proteome_count <- proteome %>%
  summarize(proteins_proteome = n_distinct(uniprot_id))

complete_clean %>%
  group_by(experiment)%>%
  summarize(proteins_detected = n_distinct(leading_razor_protein))%>%
  bind_rows(proteins)%>%
  mutate(proteins_difference = proteome_count$proteins_proteome-proteins_detected)%>%
  mutate(proteins_difference = proteins_difference/proteome_count$proteins_proteome*100, proteins_detected = proteins_detected/proteome_count$proteins_proteome*100)%>%
  gather(type, number, -experiment)%>%
  mutate(experiment = relevel(factor(str_replace_all(as.character(experiment), fixed("_"), " ")),"Total"), type = fct_rev(factor(type)))%>%
  ggplot(aes(experiment, number, fill=type))+
  geom_bar(col = "black", stat="identity")+
  labs(title = "Proteome coverage", x = "", y = "Proteome [%]")+
  scale_fill_manual(values = c("proteins_detected" = "steelblue1", "proteins_difference" = "brown1"), name = "Proteins", labels = c("Not detected", "Detected"))+
  geom_text(aes(label= round(number,1)), position = position_stack(vjust = 0.5), size=4)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

Quantitative analysis

Peptide intensity distribution before normalization

intensity_plot <- complete_clean %>%
  filter(!is.na(intensity))%>%
  select(sequence, intensity, experiment)%>%
  group_by(experiment, sequence)%>%
  summarize(sum_intensity = sum(intensity))%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), sum_intensity))+
  geom_boxplot(fill = "steelblue1")+
  scale_y_continuous(trans = scales::log2_trans(),
                     breaks = scales::trans_breaks("log2", function(x) 2^x),
                     labels = scales::trans_format("log2", scales::math_format(2^.x)))+
  labs(title = "Peptide intensity distribution over runs", subtitle = "Excluding decoys, excluding contaminants", x = "Experiment", y = "log2 Intensity")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

ggplotly(intensity_plot)

CV of intensity for each peptide

The coefficient of variation is a measure of dispersion for a distribution. In this case the intensity distribution of peptides. In contrast to the RSD not the medians of intensities for each sample are compared but the intensities for each peptide within conditions. That means the difference of replicates is displayed. The combined cv for peptides shows how similar all samples are to eachother on the peptide level.

The CV is expected to be smaller for conditions than for all samples combined. If the combined CV is similar to the CV of each replicate, experimental groups (control, treatment) are likely also very similar to eachother.

LiP

paste("No LiP sample was included in this experiment")
cv_peptide_lip <- complete_clean %>%
  filter(!is.na(intensity), pipeline == "LiP")%>%
  mutate(replicate_group = as.factor(replicate_group))%>%
  group_by(modified_sequence, charge)%>%
  mutate(cv_peptide = sd(intensity)/mean(intensity))%>%
  group_by(modified_sequence, charge, experiment)%>%
  mutate(intensity = sum(intensity))%>%
  ungroup()%>%
  group_by(modified_sequence, charge, replicate_group)%>%
  distinct(modified_sequence, intensity, cv_peptide, replicate_group)%>%
  mutate(cv_peptide_replicate = sd(intensity)/mean(intensity))%>%
  ungroup()%>%
  select(modified_sequence, cv_peptide, cv_peptide_replicate, replicate_group)%>%
  gather(type, cv, -modified_sequence, -replicate_group)%>%
  filter(!is.na(cv))%>%
  mutate(type = as.factor(if_else(type == 'cv_peptide_replicate', as.character(replicate_group), "combined")))%>%
  distinct(modified_sequence, cv, type)%>%
  group_by(type)%>%
  mutate(median = median(cv))%>%
  ungroup()

cv_peptide_lip %>%
  ggplot(aes(cv, col = type))+
  geom_density(size = 1)+
  scale_x_continuous(breaks = round(seq(min(cv_peptide_lip$cv), max(cv_peptide_lip$cv)+0.5, by = 0.2),1 ))+
  theme_bw()+
  labs(title = "CV density of modfied peptide intensities (LiP)", subtitle = "Without decoys and contaminants", x = "CV (coefficient of variation)", y = "Frequency")

cv_peptide_lip %>%
  ggplot(aes(cv))+
  geom_histogram(binwidth = 0.05, color = "black", fill = "steelblue1")+
  scale_x_continuous(breaks = round(seq(min(cv_peptide_lip$cv), max(cv_peptide_lip$cv)+0.5, by = 0.2),1 ))+
  theme_bw()+
  labs(title = "CV frequency of modified peptide intensities (LiP)", subtitle = "Without decoys and contaminants", x = "CV (coefficient of variation)", y = "Frequency")+
  geom_vline(aes(xintercept = median),col='black',size=1, linetype = "dashed")+
  facet_wrap(~paste(type, "(median:", round(median, 3), ")"), ncol=1)

Trypsin Digest

paste("No solely tryptic sample was included in this experiment.")
cv_peptide_trypsin <- complete_clean %>%
  filter(!is.na(intensity), pipeline == "Trypsin Digest")%>%
  mutate(replicate_group = as.factor(replicate_group))%>%
  group_by(modified_sequence, charge)%>%
  mutate(cv_peptide = sd(intensity)/mean(intensity))%>%
  group_by(modified_sequence, charge, experiment)%>%
  mutate(intensity = sum(intensity))%>%
  ungroup()%>%
  group_by(modified_sequence, charge, replicate_group)%>%
  distinct(modified_sequence, intensity, cv_peptide, replicate_group)%>%
  mutate(cv_peptide_replicate = sd(intensity)/mean(intensity))%>%
  ungroup()%>%
  select(modified_sequence, cv_peptide, cv_peptide_replicate, replicate_group)%>%
  gather(type, cv, -modified_sequence, -replicate_group)%>%
  filter(!is.na(cv))%>%
  mutate(type = as.factor(if_else(type == 'cv_peptide_replicate', as.character(replicate_group), "combined")))%>%
  distinct(modified_sequence, cv, type)%>%
  group_by(type)%>%
  mutate(median = median(cv))%>%
  ungroup()

cv_peptide_trypsin %>%
  ggplot(aes(cv, col = type))+
  geom_density(size = 1)+
  scale_x_continuous(breaks = round(seq(min(cv_peptide_trypsin$cv), max(cv_peptide_trypsin$cv)+0.5, by = 0.2),1 ))+
  theme_bw()+
  labs(title = "CV density of modfied peptide intensities (Tryptic digest)", subtitle = "Without decoys and contaminants", x = "CV (coefficient of variation)", y = "Frequency")

cv_peptide_trypsin %>%
  ggplot(aes(cv))+
  geom_histogram(binwidth = 0.05, color = "black", fill = "steelblue1")+
  scale_x_continuous(breaks = round(seq(min(cv_peptide_trypsin$cv), max(cv_peptide_trypsin$cv)+0.5, by = 0.2),1 ))+
  theme_bw()+
  labs(title = "CV frequency of modified peptide intensities (Tryptic digest)", subtitle = "Without decoys and contaminants", x = "CV (coefficient of variation)", y = "Frequency")+
  geom_vline(aes(xintercept = median),col='black',size=1, linetype = "dashed")+
  facet_wrap(~paste(type, "(median:", round(median, 3), ")"), ncol=1)

Principal component analysis of peptide intensities

A principal componant analysis (PCA) is computed on all intensities. It allows a visual exploration of similarities between samples and conditions. A clustering of similar conditions is expected.

pca_input <- complete_clean %>%
  filter(!is.na(intensity))%>%
  select(sequence, intensity, experiment)%>%
  group_by(experiment, sequence)%>%
  summarize(sum_intensity = sum(intensity))%>%
  spread(experiment, sum_intensity)%>%
  drop_na()%>%
  select(-sequence)%>%
  t(.)

pca <- prcomp(pca_input, center = TRUE)

pca_df <- as.data.frame(pca$x)%>%
  mutate(experiment = factor(row.names(.)))%>%
  left_join(map, by = "experiment")

pca_sdev_df <- as.data.frame(pca$sdev)
pca_sdev_df <- pca_sdev_df %>%
  mutate(percent_variance = (pca$sdev^2/sum(pca$sdev^2)*100), dimension = row.names(.))

pca_df%>%
  ggplot(aes(PC1, PC2, col=replicate_group, shape = pipeline))+
  geom_point(size = 3)+
  labs(title = "Principal component analysis of samples", subtitle = "Excluding decoys, excluding contaminants", x = paste("PC1", "(", round(pca_sdev_df$percent_variance[pca_sdev_df$dimension == 1],1), "%)"), y = paste("PC2", "(", round(pca_sdev_df$percent_variance[pca_sdev_df$dimension == 2],1), "%)"), color = "Experiment", shape = "Pipeline")+
  geom_text_repel(aes(label = paste(str_replace_all(as.character(experiment), fixed("_"), " ")),  color = replicate_group), size = 4, show.legend = FALSE)+
  theme_bw()

Correlation of samples

The correlation between samples allows the identification of sample clustering similar to PCA.

Peptide intensities of samples were correlated according to Spearman’s rank correlation coefficient. Samples are expected to cluster by conditions and replicates are expected to be the most similar. There should not be any batch effects that can be identified though clustering by running order.

correlation <- complete_clean %>% 
  filter(!is.na(intensity))%>%
  group_by(experiment, sequence)%>%
  mutate(sum_intensity = sum(intensity))%>%
  group_by(experiment)%>%
  distinct(sum_intensity, sequence, experiment)%>%
  spread(experiment, sum_intensity)%>%
  column_to_rownames('sequence')%>%
  drop_na()

cor(correlation, method = "spearman")%>%
  pheatmap()

Median normalization

After median normalisation data was log2 transformed in order to make mean and variance independent. This is important for further statistical testing due to the fact that then a change of x units is as significant for highly abundant peptides, as it is for low abundant ones.

It does not matter if mean normalization is performed before or after log2 transformation. However, the RSD after normalisation can be calculated with the standard formular if the log2 transformation is only performed afterwards. The RSD of sample medians after median normalisation is expected to be 0.

normalised_lip <- complete_clean %>%
  filter(!is.na(intensity), pipeline == "LiP")%>%
  group_by(experiment, sequence)%>%
  mutate(sum_intensity = sum(intensity))%>%
  group_by(experiment)%>%
  distinct(sum_intensity, sequence)%>%
  mutate(median_run_intensity = median(sum_intensity))%>%
  ungroup()%>%
  mutate(median_intensity = median(unique(median_run_intensity), na.rm = TRUE))%>%
  group_by(experiment)%>%
  mutate(normalised_intensity = (sum_intensity/median_run_intensity)*median_intensity)%>%
  mutate(normalised_intensity_log2 = log2(normalised_intensity))%>%
  right_join(complete_clean%>%filter(pipeline == "LiP"), by = c("experiment", "sequence"))%>%
  select(-median_run_intensity, -median_intensity)%>%
  group_by(experiment)%>%
  mutate(median_run_intensity = median(intensity, na.rm = TRUE))%>%
  ungroup()%>%
  mutate(median_intensity = median(unique(median_run_intensity), na.rm = TRUE))%>%
  group_by(experiment)%>%
  mutate(normalised_precursor_intensity = (intensity/median_run_intensity)*median_intensity)%>%
  select(-median_run_intensity, -median_intensity)%>%
  mutate(normalised_precursor_intensity_log2 = log2(normalised_precursor_intensity))


normalised_trypsin <- complete_clean %>%
  filter(!is.na(intensity), pipeline == "Trypsin Digest")%>%
  group_by(experiment, sequence)%>%
  mutate(sum_intensity = sum(intensity))%>%
  group_by(experiment)%>%
  distinct(sum_intensity, sequence)%>%
  mutate(median_run_intensity = median(sum_intensity))%>%
  ungroup()%>%
  mutate(median_intensity = median(unique(median_run_intensity), na.rm = TRUE))%>%
  group_by(experiment)%>%
  mutate(normalised_intensity = (sum_intensity/median_run_intensity)*median_intensity)%>%
  mutate(normalised_intensity_log2 = log2(normalised_intensity))%>%
  right_join(complete_clean%>%filter(pipeline == "Trypsin Digest"), by = c("experiment", "sequence"))%>%
  select(-median_run_intensity, -median_intensity)%>%
  group_by(experiment)%>%
  mutate(median_run_intensity = median(intensity, na.rm = TRUE))%>%
  ungroup()%>%
  mutate(median_intensity = median(unique(median_run_intensity), na.rm = TRUE))%>%
  group_by(experiment)%>%
  mutate(normalised_precursor_intensity = (intensity/median_run_intensity)*median_intensity)%>%
  select(-median_run_intensity, -median_intensity)%>%
  mutate(normalised_precursor_intensity_log2 = log2(normalised_precursor_intensity))

normalised <- bind_rows(normalised_lip, normalised_trypsin)

normalised_intensities_plot <- normalised %>%
  group_by(experiment)%>%
  distinct(sequence, normalised_intensity)%>%
  filter(!is.na(normalised_intensity))%>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), normalised_intensity))+
  geom_boxplot(fill = "steelblue1")+
  labs(title = "Median normalised peptide intensity distribution over runs", subtitle = "Excluding decoys, excluding contaminants", x = "Experiment", y = "log2 Intensity")+
  scale_y_continuous(trans = scales::log2_trans(),
                     breaks = scales::trans_breaks("log2", function(x) 2^x),
                     labels = scales::trans_format("log2", scales::math_format(2^.x)))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1))

ggplotly(normalised_intensities_plot)
RSD_lip_normalised <- normalised %>%
  distinct(sequence, normalised_intensity, pipeline)%>%
  filter(!is.na(normalised_intensity), pipeline == "LiP")%>%
  group_by(experiment)%>%
  summarize(median = median(normalised_intensity))%>%
  mutate(RSD_median = sd(median)/mean(median)*100, mean_median = mean(median), sd_median = sd(median))

if(nrow(RSD_lip_normalised) == 0){RSD_lip_normalised[1, ] <- NA}

RSD_trypsin_normalised <- normalised %>%
  distinct(sequence, normalised_intensity, pipeline)%>%
  filter(!is.na(normalised_intensity), pipeline == "Trypsin Digest")%>%
  group_by(experiment)%>%
  summarize(median = median(normalised_intensity))%>%
  mutate(RSD_median = sd(median)/mean(median)*100, mean_median = mean(median), sd_median = sd(median))

if(nrow(RSD_trypsin_normalised) == 0){RSD_trypsin_normalised[1, ] <- NA}

LiP

RSD of median intensities: 11.27%
RSD of normalised median intensities: 0%

Trypsin

RSD of median intensities: 8.58%
RSD of normalised median intensities: 0%

Principal component analysis of normalised peptide intensities

The PCA after normalisation should not yield a completely different result than before normalisation if there were not large differences in intensities between samples. The clustering of conditions should improve slightly due to less variablity of intensities.

pca_input_normalised <- normalised %>%
  filter(!is.na(normalised_intensity))%>%
  distinct(sequence, normalised_intensity, experiment)%>%
  group_by(experiment, sequence)%>%
  summarize(sum_intensity = sum(normalised_intensity))%>%
  spread(experiment, sum_intensity)%>%
  drop_na()%>%
  select(-sequence)%>%
  t(.)

pca_normalised <- prcomp(pca_input_normalised, center = TRUE)

pca_df_normalised <- as.data.frame(pca_normalised$x)%>%
  mutate(experiment = factor(row.names(.)))%>%
  left_join(map, by = "experiment")

pca_sdev_df_normalised <- as.data.frame(pca_normalised$sdev)
pca_sdev_df_normalised <- pca_sdev_df_normalised %>%
  mutate(percent_variance = (pca$sdev^2/sum(pca$sdev^2)*100), dimension = row.names(.))

pca_df_normalised%>%
  ggplot(aes(PC1, PC2, col=replicate_group, shape = pipeline))+
  geom_point(size = 3)+
  labs(title = "Principal component analysis of samples", subtitle = "Excluding decoys, excluding contaminants", x = paste("PC1", "(", round(pca_sdev_df$percent_variance[pca_sdev_df$dimension == 1],1), "%)"), y = paste("PC2", "(", round(pca_sdev_df$percent_variance[pca_sdev_df$dimension == 2],1), "%)"), color = "Experiment", shape = "Pipeline")+
  geom_text_repel(aes(label = paste(str_replace_all(as.character(experiment), fixed("_"), " ")),  color = replicate_group), size = 4, show.legend = FALSE)+
  theme_bw()

Correlation normalised samples

Also for the correlation no large changes should be observed if intensities were in the same range for all samples. The clustering of replicates should improve slightly after normalisation or stay the same.

correlation_normalised <- normalised %>% 
  filter(!is.na(normalised_intensity))%>%
  distinct(experiment, normalised_intensity, sequence)%>%
  spread(experiment, normalised_intensity)%>%
  column_to_rownames('sequence')%>%
  drop_na()

cor(correlation_normalised, method = "spearman")%>%
  pheatmap()

Analysis of missingness

The distribution of missing values in an experiment is an important indicator for the quality of the experiment. If match between runs was used that amount of missing values is expected to be lower.

The plots display how many missing values there are per sample (separated by LiP and tryptic control) and how missing values are distributed in the first 1000 observations (not separated by Lip and tryptic control). This allows an assessment of potential systematic missingness in some of the samples. It is for example normal that many peptides from the trypsin control are not overlapping with LiP experiments and vice versa.

missing_per_sample_lip <- normalised %>%
  ungroup()%>%
  filter(pipeline == "LiP")%>%
  distinct(experiment, sequence, normalised_intensity_log2, pipeline)%>%
  complete(experiment, sequence, pipeline)%>%
  group_by(experiment)%>%
  summarize(frequency = sum(is.na(normalised_intensity_log2))/n()*100, pipeline = unique(pipeline))

missing_per_sample <- normalised %>%
  ungroup()%>%
  filter(pipeline == "Trypsin Digest")%>%
  distinct(experiment, sequence, normalised_intensity_log2, pipeline)%>%
  complete(experiment, sequence, pipeline)%>%
  group_by(experiment)%>%
  summarize(frequency = sum(is.na(normalised_intensity_log2))/n()*100, pipeline = unique(pipeline))%>%
  bind_rows(missing_per_sample_lip)%>%
  mutate(experiment = as.factor(experiment))

missing_per_sample_plot <- missing_per_sample %>%
  ggplot(aes(paste(str_replace_all(as.character(experiment), fixed("_"), " ")), frequency))+
  geom_col(fill = "seashell3", col = "black")+
  facet_wrap(~pipeline, scales = "free_x")+
  labs(title = "Missing values of normalised intensity for each sample and pipeline", x = "", y = "Frequency of missing values [%]")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust =1, size =8))

ggplotly(missing_per_sample_plot)
vis_missing <- normalised %>%
  distinct(experiment, sequence, normalised_intensity_log2)%>%
  complete(experiment, sequence)%>%
  spread(experiment, normalised_intensity_log2)%>%
  select(-sequence)%>%
  slice(1:1000)

vis_miss(vis_missing, cluster = TRUE)+
  theme_bw()+
  labs(title = "Missing values of normalised intensity for each sample (first 1000 observations)")+
  theme(axis.text.x = element_text(angle = 45, hjust=0))

Protein intensity calculation

The protein intensity is computed using the MaxLFQ algorythm implemented into R by the iq package. It is based on MaxLFQ that uses peptide ration information to calculate protein abundances. It can be found as a feature in Maxquant. Only protein intensities for proteins with more than one precursor in a respective run are computed.

The result of the MaxLFQ calculation by iq can be examined using the “MaxLFQ Explorer” shiny app. It can be found in the same repository as this script.

split_tibble <- function(tibble, col = 'col'){
  tibble %>% split(., .[,col])
}

transform_list <- function(x, .pb=NULL){
  if ((!is.null(.pb)) && inherits(.pb, "Progress") && (.pb$i < .pb$n)) .pb$tick()$print()
  
  x%>%
    select(-leading_razor_protein)%>%
    pivot_wider(names_from = experiment, values_from = normalised_precursor_intensity_log2)%>%
    column_to_rownames("precursor")%>%
    as.matrix()
}

sample_names <- sort(unique(complete_clean$experiment))

pb <- progress_estimated(length(unique(normalised$leading_razor_protein)))

normalised_list <- normalised %>%
  ungroup()%>%
  filter(type != "MSMS")%>%
  unite("precursor", modified_sequence, charge, sep = "")%>%
  select(experiment, precursor, normalised_precursor_intensity)%>%
  group_by(experiment, precursor)%>%
  summarize(normalised_precursor_intensity_log2 = log2(sum(normalised_precursor_intensity)))%>%
  ungroup()%>%
  complete(experiment, precursor)%>%
  left_join(normalised%>%ungroup()%>%unite("precursor", modified_sequence, charge, sep = "")%>%distinct(precursor, leading_razor_protein), by = "precursor")%>%
  split_tibble("leading_razor_protein")%>%
  map(transform_list)


maxlfq <- normalised_list %>%
  map(iq::maxLFQ)%>%
  modify_depth(.depth = 1, .f = pluck, "estimate")%>%
  map(function(x) matrix(x, ncol = length(x), nrow = 1, dimnames = list("MaxLFQ_iq", sample_names)))%>%
  map(as_tibble, rownames = NA)

protint <- normalised_list %>%
  map(as_tibble, rownames = NA)%>%
  map2(maxlfq, rbind)%>%
  map(rownames_to_column, "precursor")%>%
  map(pivot_longer, -precursor, names_to = "experiment", values_to = "intensity")

maxlfq_tibble <- maxlfq %>%
  map(pivot_longer, everything(), names_to = "experiment", values_to = "protein_intensity_log2")%>%
  map_dfr(bind_rows, .id = "leading_razor_protein")

complete_normalised <- normalised %>%
  mutate(precursor = paste0(modified_sequence, charge))%>%
  select(experiment, leading_razor_protein, precursor, normalised_precursor_intensity_log2)%>%
  left_join(maxlfq_tibble, by = c("experiment", "leading_razor_protein"))%>%
  group_by(experiment, leading_razor_protein)%>%
  filter(!is.na(normalised_precursor_intensity_log2))%>%
  mutate(n_precursor = n_distinct(precursor))%>%
  filter(n_precursor > 1)%>%
  distinct(experiment, leading_razor_protein, protein_intensity_log2)%>%
  right_join(normalised, by = c("experiment", "leading_razor_protein"))
fwrite(complete_normalised, "complete_normalised.txt", quote = FALSE, sep = "\t", row.names = FALSE, na = "NA")